library(tidyverse) # for graphing and data cleaning
library(googlesheets4) # for reading googlesheet data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
gs4_deauth() # To not have to authorize each time you knit.
theme_set(theme_minimal()) # My favorite ggplot() theme :)
#Lisa's garden data
garden_harvest <- read_sheet("https://docs.google.com/spreadsheets/d/1DekSazCzKqPS2jnGhKue7tLxRU3GVL1oxi-4bEM5IWw/edit?usp=sharing") %>%
mutate(date = ymd(date))
# Seeds/plants (and other garden supply) costs
supply_costs <- read_sheet("https://docs.google.com/spreadsheets/d/1dPVHwZgR9BxpigbHLnA0U99TtVHHQtUzNB9UR0wvb7o/edit?usp=sharing",
col_types = "ccccnn")
# Planting dates and locations
plant_date_loc <- read_sheet("https://docs.google.com/spreadsheets/d/11YH0NtXQTncQbUse5wOsTtLSKAiNogjUA21jnX5Pnl4/edit?usp=sharing",
col_types = "cccnDlc")%>%
mutate(date = ymd(date))
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week. Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(week_day = wday(date, label = TRUE),
wt_lbs = weight*0.00220462) %>%
group_by(vegetable, week_day) %>%
summarize(daily_wt_lbs = sum(wt_lbs)) %>%
pivot_wider(id_cols = vegetable,
names_from = week_day,
values_from = daily_wt_lbs)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot variable from the plant_date_loc table. This will not turn out perfectly. What is the problem? How might you fix it?garden_harvest %>%
mutate(wt_lbs = weight*0.00220462) %>%
group_by(vegetable, variety, date) %>%
summarise(tot_wt_lbs = sum(wt_lbs)) %>%
left_join(select(plant_date_loc, vegetable, variety, plot),
by = c("vegetable", "variety"))
The problem is that there are a few N/A within the plot function. You could potentially fix it with a NA function however, it might change the meaning of the data table.
garden_harvest and supply_cost datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.First calculate how many lbs of each vegetable and variety was harvested. Use the summarize function to change grams to lbs. Then group_by vegetable and variety. Then use the summarize function to determine the total weight harvested of each vegetable and variety. (See part of code above for this past part.) Then left join the garden_harvest data set with the supply_cost data set and join them by vegetable and variety. After joining the two data sets use the mutate function and create a new variable named cost_veggies_grown by multiplying the total weight harvested for each vegetable and variety and multiply it by the price with tax of each. Then take the prices by pound from whole foods for each type of vegetable and variety. Multiply these prices by the total weight harvested for each vegetable and create a new variable called whole_food_price. Then use the summarise function and create a new variable called savings and take the cost_veggies_grown for each vegetable and variety minus whole_food_price. The new variable savings should demonstrate the money saved or the money lost (if negative).
tomatoes <- garden_harvest %>%
mutate(wt_lbs = weight*0.00220462) %>%
filter(vegetable == "tomatoes") %>%
group_by(variety) %>%
mutate(first_harvest = min(date)) %>%
arrange(first_harvest)
ggplot(tomatoes, aes(x = reorder(str_to_title(variety), first_harvest), y = wt_lbs)) +
geom_bar(stat = "identity", fill = "red") +
theme_minimal() +
labs( x = "Variety", y = "Total Weight of Harvest (lbs)", title = "Total Harvest of Tomatoes by Variety") +
coord_flip()
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
group_by(vegetable, variety) %>%
mutate(varieties_lower = str_to_lower(variety),
length_name = str_length(varieties_lower)) %>%
arrange(vegetable, length_name) %>%
distinct(length_name)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
mutate(variety_lower = str_to_lower(variety)) %>%
filter(str_detect(variety, "ar") | str_detect(variety, "er")) %>%
distinct(variety_lower)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density( fill = "blue") +
labs( x = "Start Date of the rentals", y = "Events", title = "Distribution of Events by Start Date of Rentals")
The busiest time for rentals is during the earlier months (October and November) and as winter approaches the number of rentals decreases.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(time_hour = hour(sdate),
time_min = minute(sdate),
new_time = time_hour + (time_min * 1/60)) %>%
ggplot(aes(x = new_time)) +
geom_density( fill = "orange") +
labs( x = "Time of Day", y = "Events", title = "Distribution of Events by Time of day")
The most popular times of day to rent bikes are during both morning and evening rush hour. However afternoon rush hour is more popular than morning rush hour.
Trips %>%
mutate(week_day = wday(sdate, label = TRUE))%>%
ggplot(aes(y = week_day)) +
geom_bar( fill = "orange") +
labs( x = "Events", y = "Day of the week", title = "Distribution of Events by Day of the week")
The busiest day to rent bikes appears to be Fridays closely followed by Monday.
Trips %>%
mutate(time_hour = hour(sdate),
time_min = minute(sdate),
new_time = time_hour + (time_min * 1/60),
week_days = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = new_time)) +
geom_density( fill = "orange") +
labs( x = "Time of Day", y = "Events", title = "Distribution of Events by Time of day") +
facet_wrap(vars(week_days))
Yes there is a pattern. On week days most bike rentals happen during morning and evening rush hours. On weekends most rentals happen mid day to mid afternoon.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises. Repeat the graphic from Exercise @ref(exr:exr-temp) (d) with the following changes:
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(time_hour = hour(sdate),
time_min = minute(sdate),
new_time = time_hour + (time_min * 1/60),
week_days = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = new_time, fill= client)) +
geom_density(color = "NA", alpha = 0.5) +
labs( x = "Time of Day", y = "Events", title = "Distribution of Events by Time of day") +
facet_wrap(vars(week_days))
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(time_hour = hour(sdate),
time_min = minute(sdate),
new_time = time_hour + (time_min * 1/60),
week_days = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = new_time, fill= client)) +
geom_density( position = position_stack(), color = "NA", alpha = 0.5) +
labs( x = "Time of Day", y = "Events", title = "Distribution of Events by Time of day") +
facet_wrap(vars(week_days))
While 12 is nicer to look at the graph from from 11 gives more information and is easier to interpret. It is hard on 12 to determine where each starts and where it ends because they are stacked. In 11 the distributions are not stacked therefore with the transparency you can see where they start and end and they are easier to read.
weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(time_hour = hour(sdate),
time_min = minute(sdate),
new_time = time_hour + (time_min * 1/60),
week_days = wday(sdate,label = TRUE),
the_weekend = ifelse(week_days == c("Sat", "Sun"), "Weekend", "Weekday")) %>%
ggplot(aes(x = new_time, fill= client)) +
geom_density(color = "NA", alpha = 0.5) +
scale_fill_brewer(palette = "Pastel12") +
labs( x = "Time of Day", y = "Events", title = "Distribution of Events by Time of day on Weekdays vs Weekends", subtitle = "Based on Client Type") +
theme_minimal() +
facet_wrap(vars(the_weekend))
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(time_hour = hour(sdate),
time_min = minute(sdate),
new_time = time_hour + (time_min * 1/60),
week_days = wday(sdate,label = TRUE),
the_weekend = ifelse(week_days == c("Sat", "Sun"), "Weekend", "Weekday")) %>%
ggplot(aes(x = new_time, fill= week_days)) +
geom_density(color = "NA", alpha = 0.5) +
scale_fill_brewer(palette = "Pastel12") +
labs( x = "Time of Day", y = "Events", title = "Distribution of Events by Time of day on Weekdays vs Weekends", subtitle = "Based on Client Type") +
theme_minimal() +
facet_wrap(vars(client))
The graph that is better is the graph for 13. The graph for 14 is very difficult to look at and interpret. The graph for 13 is easier to interpret as it only has two distributions laid on top of each other. The graph in 14 does have more information on each individual day not just the general weekend and week days. However, I think the ease at which 13 is read makes up for the less information it possesses.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Stations %>%
left_join(Trips,
by = c("name"= "sstation")) %>%
group_by(name) %>%
mutate(total_departures = n()) %>%
ggplot(aes(x = long, y = lat, color = total_departures)) +
geom_point(alpha = 0.3) +
labs(title = "Total Departure from Each Rental Location",
x = "Longtitude",
y = "Latitute")
Stations %>%
left_join(Trips,
by = c("name" = "sstation")) %>%
group_by(name, long, lat) %>%
summarize(percent_casual = mean(client == "Casual")) %>%
ggplot(aes(x = long, y = lat, color = percent_casual)) +
geom_point() +
labs(title = "Percent Departures by Casual Users",
x = "Longtitude",
y = "Latitute")
There are a lot of points along the 38.8 latitude and between the -77.1 longitude an -77.0. There are also appear to be a diagonal pattern from 39. 0 latitude to 38.9. The area of lots of dots between -77.1 to -77.0 along the latitude line 38.8 it is likely downtown of D.C. and maybe near metro stations.
as_date(sdate) converts sdate from date-time format to date format.topten_trips <- Trips %>%
mutate(tripdate = as_date(sdate)) %>%
group_by(sstation, tripdate) %>%
count() %>%
arrange(desc(n)) %>%
head(10)
topten_trips
Trips %>%
mutate(tripdate = as_date(sdate)) %>%
semi_join(topten_trips,
by = c("sstation", "tripdate"))
Trips %>%
mutate(tripdate = as_date(sdate)) %>%
semi_join(topten_trips,
by = c("sstation", "tripdate")) %>%
mutate(day_of_week = wday(sdate, label = TRUE)) %>%
group_by(client, day_of_week) %>%
summarize(num_riders = n()) %>%
mutate(num_prop = num_riders/ sum(num_riders)) %>%
pivot_wider(id_cols = day_of_week,
names_from = client,
values_from = num_prop)
The data table shows the proportions of riding each day for each type of client.
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.kids %>%
pivot_wider(id_cols = c(state, year),
names_from = variable,
values_from = inf_adj) %>%
group_by(state, year) %>%
filter(year %in% c("1997", "2016")) %>%
group_by(state, year) %>%
ggplot(aes(x = year, y = lib)) +
geom_line() +
#gghighlight(lib > lib)+ I wanted to figure out a way to highlight the two states that lowered their spending (Maryland adn Mississippi). But I could not figure it out.
facet_geo(~state,
scales = "free")
DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?